Method for generating a coherent representation for at least two log files

ABSTRACT

Provided is a Computer-implemented method for Receiving the at least two log files; wherein each log file of the at least two log files includes at least one log entry with at least one time stamp and at least one message; wherein the at least two log files differ from one another with respect to at least one distinctive criteria; Extracting at least one additional information of each log file of the at least two log files; and Combining each log file of the at least two log files with the extracted additional information into at least two processed log tiles; wherein the at least two processed log files comply with a coherent representation. Further, the invention relates to a corresponding computer program product and generating unit.

FIELD OF TECHNOLOGY

The following relates to a computer-implemented method for generating acoherent representation for at least two log tiles. Further, theinvention relates to a corresponding computer program product andgenerating unit.

BACKGROUND

The amount of data or data volume is still increasing until now. Thedata can include human- and machine-generated data. This large orvoluminous data is known under the terms “big data” or “large scaledata”. Especially, the digital data will substantially grow in the nextyears in view of the digital transformation and Industry 4.0.

Thus, the importance of automated large scale data analysis or dataprocessing will gain in importance since the manual analysis becomesunfeasible for the experts. This analysis or processing paradigmencompasses a series of different methods and systems to process bigdata. Big data challenges include in particular capturing data, datastorage, data analysis, search, sharing, transfer, visualization,querying, updating, information privacy and data source.

Considering complex industrial plants, the industrial plants usuallycomprise distinct parts, modules or units with a multiplicity ofindividual functions. Exemplary units include sensors and actuators. Theunits and functions have to be controlled and regulated in aninteracting manner. They are often monitored, controlled and regulatedby automation systems, for example the Simatic S7 system of Siemens AG.The units can either exchange data directly with one another orcommunicate via a bus system with one another and with a master controlunit, if the plant has such a unit. The units are connected to the bussystem via parallel or, more often, serial interfaces.

A large amount of log files is generated during operation of suchindustrial plants. Each log file comprises one or more log entries andhas a different structure or format depending on the computing unit,program or process it was generated by. Log mining tasks struggle withthe variety of log file structures, formats and types that can be foundin heterogenous computer systems, such as the aforementioned industrialplants. Exemplary tasks include the identification of anomalies in thelog entries, comparison of the log files from one industrial plant overtime, extraction of log files and/or extraction of relevant informationof the log files from different industrial plants.

According to prior art, users or experts have to manually analyze thehuge amount of log files and to extract the relevant information fromthe log files. However, such manual approaches rely on expert knowledgeand require a lot of manual effort. Thus, they are errorprone, timeconsuming and expensive.

According to prior art, besides the manual approaches, the informationextraction can be accomplished automatically with regular expressions.However, the patterns have to be defined and tested by an expert basedon expert knowledge. A disadvantage is that the definition, testing andpattern matching is error prone and time-consuming.

An aspect relates to provide a computer-implemented method forgenerating a coherent representation for at least two log files in anefficient and reliable manner.

SUMMARY

This problem is according to one aspect of the invention solved bycomputer-implemented method for generating a coherent representation forat least two log files, comprising the steps:

-   -   a. Receiving the at least two log files; wherein    -   b. each log file of the at least two log files comprises at        least one log entry with at least one time stamp and at least        one message; wherein    -   c. the at least two log files differ from one another with        respect to at least one distinctive criteria;    -   d. Extracting at least one additional information of each log        file of the at least two log files; and    -   e. Combining each log file of the at least two log files with        the extracted additional information into at least two processed        log files; wherein    -   f. the at least two processed log files comply with a coherent        representation.

Accordingly, the invention is directed to a computer-implemented methodfor generating a coherent representation for at least two log files. Inother words, the log files comply with a coherent representation or arein accordance with a coherent representation, which can be directly usedas input for further method steps or applications e.g. log mining tasks.Log mining tasks are directed to the aforementioned analysis of logfiles. In other words, the coherent representation can be used as inputfor log mining,

In a first step, the log files are provided as input. During operation,a computing unit or technical system generates a huge amount of logfiles, see further above. Thereby, the log files are in most of thecases of different format or type. In other words, according to thisexample, the distinctive criterium is the format or the type. Forexample, the log entry structure can vary between different types of logfiles i.e. those produced or generated by different programs orcomputing units.

Each log file of the plurality of log files comprises at least atimestarnp and a message. Furthermore, each log file can compriseadditional elements or information including an internal structure,indicating message code and indicators of the computing unit, technicalsystem, subsystem or component e.g. where it was generated. According towhich, in this example the additional element or information gives anindication about the origin of the log file.

In further steps this additional information is extracted from thediverse log files and incorporated into processed log files. The termextracting can be equally referred to as parsing. In other words, thelog files are extended with the addition information. The incorporationor extension allows understanding the log files not only in terms oftheir content, but also their origin and other important data.

The processed log files are in accordance with a coherentrepresentation. The coherent representation allows the consideration ofdiverse types of log files from different origins and varying structuralcharacteristics.

In one aspect of the invention the at least one distinctive criterium isselected from the group comprising type, format and structure.Accordingly, a log file can have one or more log entries. Thus,according to some types of log files, a log entry is exactly one line.According to other types, a log entry comprises multiple lines.Moreover, separators between log entries or between different parts of alog message of a log entry can differ from program to program. Timestamps can have different formats in different log files. Part of thetimestamp e.g. date can be included in the log file name or in one ofthe header lines, while the remainder e.g. time is recorded for each logentry. The advantage is that the parsing or extracting step can beflexibly applied on diverse log files irrespective of any differences.

In one aspect of the invention the additional information is aninformation selected from the group comprising: a computing unit whichgenerated the log file, a program which generated the log file,configuration information of the computing unit which generated the logfile, a log entry template and a connection between a log entry and thecomputing unit the log entry references. Accordingly, any additionalauxiliary information can be incorporated.

Log Entry Template:

Usually, log entries are instances of a log entry template. This meansthat the message of the log entry consists partly of a fixed text andpartly of dynamically generated values, thus two parts. For example, thelog entry template can be expressed as “Unable to open file % s”,whereas the part “Unable to open file” is the fixed part and “% s” isthe variable part. The actual instances have specific file paths in themessage text.

The advantage of this additionally or auxiliary information is that theinformation content of the log files is significantly increased.

In another aspect of the invention the coherent representation is aninput for log mining or any other analysis.

In a further aspect of the invention the method comprises the furtherstep of loading the coherent representation into a knowledge graph.

Accordingly, the output of the method or result in the form of thecoherent representation can be used for distinct tasks. Thereby, theknowledge graph is important for diagnosis and repair of problems in anindustrial environment e.g. industrial plants. In other words, themethod allows the transformation of a set or collection of diverse logfiles from computing units or systems into a knowledge graph. Thus, theproblems e.g. defects or failures of industrial plants can be handled inan efficient timely manner.

A further aspect of the invention is a computer program product directlyloadable into an internal memory of a computer, comprising software codeportions for performing the steps according to the aforementioned methodwhen said computer program product is running on a computer.

A further aspect of the invention is a generating unit for performingthe aforementioned method.

The unit may be realized as any device, or any means, for computing, inparticular for executing a software, an app, or an algorithm. Forexample, the generating unit may consist of or comprise a centralprocessing unit (CPU) and/or a memory operatively connected to the CPU.The unit may also comprise an array of CPUs, an array of graphicalprocessing units (GPUs), at least one application-specific integratedcircuit (ASIC), at least one field-programmable gate array, or anycombination of the foregoing. The unit may comprise at least one modulewhich in turn may comprise software and/or hardware. Some, or even all,modules of the units may be implemented by a cloud computing platform.

BRIEF DESCRIPTION

Some of the embodiments will be described in detail, with reference tothe following figures, wherein like designations denote like members,where in:

FIG. 1 illustrates a flowchart of the method according to the invention;

FIG. 2 illustrates an exemplary knowledge graph according to anembodiment of the invention;

FIG. 3 illustrates distinct log tiles according to an embodiment of theinvention;

FIG. 4 illustrates distinct configuration tiles according to anembodiment of the invention; and

FIG. 5 illustrates an exemplary use case of the method according to theinvention.

DETAILED DESCRIPTION

FIG. 1 illustrates a flowchart of the method according to the inventionwith the method steps S1 to S3. The method steps S1 to S3 will beexplained in the following in more detail.

In a first step the at least two log tiles are received S1, wherein eachlog file of the at least two log files comprises at least one log entry10 with at least one time stamp 12 and at least one message 14, whereinthe at least two log tiles differ from one another with respect to atleast one distinctive criteria. These log files are depicted in FIG. 3,

In a second step at least one additional information of each log file ofthe at least two log files is extracted S2.

In a third step each log file of the at least two log files is combinedwith the extracted additional information into at least two processedlog tiles S3, wherein the at least two processed log files comply with acoherent representation.

The method according to the invention results in the coherentrepresentation, which can be directly loaded used for a knowledge graph.The method can be performed by the generating unit. The generating unitcan be equally referred to as universal parser or universal parsingunit.

Additional or Auxiliary Information

-   -   a computing unit which generated the log file        -   The information about the computing unit the log file was            generated by can be collected.    -   a program which generated the log file        -   The information about the program the log file was generated            by can be collected, in particular the name of the program            that generated the log file can be extracted.        -   Log files generated by different computing units, programs            or processes can end up in different locations i.e. along            different file paths. The file paths can contain the            additionally information about what computing units,            programs or processes generated which log files. The            algorithm is represented with the following exemplar)/pseudo            code:

PARSELOGFILE(filepath): logFile = openRead(filepath) logEntries = [ ]while NOT logFile.endOfFile( )  line = logFile.readLine( )  num =+ 1 buffer = ””  ts = ””  if findTimestamp(line): buffer += line  else: logEntries.append(buffer, ts)  ts = findTimestamp(line)  buffer =splitString(line, ts) endwhile logFile.close( ) return logEntriesfunction splitString(l, ts) pos = l.find(ts) + ts.lenght return l[pos:]endfunction function findTimestamp(l)  // Set of regular expressionsspecifying  // different formats of timestamps  tsRegExList  for regExin tsRegExList  if regEx.match(l):  return regEx.match(l)  else  return0 endfunction

-   -    Accordingly, the paths of the log files can be extracted to        identify the computing unit, program or process that generated        the respective log file. Different programs tend to write their        log files into separate locations and data from different        compute units is likely to be dumped separately. Thus, the        specific log entries can be associated with the respective        computing unit, program or process.    -   configuration information of the computing unit which generated        the log file        -   The device configuration information can be collected, e.g.            values of configuration settings in the log entries.            Further, certain log files can be linked to the computing            units, program or process that generated them.

For example, the configuration information or file of a program mightspecify where the log files will be written or set flags for certainbehaviors. These configuration files are depicted in FIG. 4.

-   -   a log entry template        -   The templates of the underlying structure that log entry            messages have can be collected.        -   Accordingly, log files from large distributed systems can            reflect the system structure:        -   There can be multiple computing units of different types or            fulfilling different roles e.g. servers and clients or            embedded systems, but running same or similar software            programs. Thus, the log file dumps from each such a            computing unit contain same or similar types of log files.            Further, computing units generating different types of log            entries likely have different functions.        -   Moreover, the log entry messages can comprise information            about network organization e.g. by mentioning names or IP            addresses or different computers.        -   An exemplary log file dump or snapshot can be expressed as            follows:        -   PlantX/ComputerY/file_path_for_programZ/logs (or            settings/config files)        -   The log entry template can be determined by clustering or            grouping the message texts and identification of invariant            parts. Thereby, the variable parts are the template            parameters and the messages with the same fixed palls are            generated from the same templates.        -   Having identified the log entry templates, the            multi-language versions of the same template can be            identified as well since they are generated by the same            computing unit, program or process and thus have the same            number or parameters. This semantic verification can be            performed manually or automatically with automated            translators.    -   a connection between a log entry and the computing unit the log        entry references

The interconnections between log entries and computing units or devicesthey reference can be collected as well. Accordingly, the log entrymessages can be used to identify cross-reference computer names and IPaddresses.

Knowledge Graph

The output can be loaded into a knowledge graph, as explained furtherabove. An exemplary knowledge graph is shown in FIG. 2, comprising thefollowing entities and relations:

-   -   A Plant consists of multiple devices    -   Some devices have computers in order to perform computations    -   A process is an instance of a program running on a computer    -   A program can have multiple General Log Templates (GLT)    -   Each GLT has a message template with several parameters    -   A log template is a language-specific version of a GLT    -   A log entry 10 is an instantiation of a log template    -   A log entry 10 has a timestamp 12 (TS)    -   A log entry 10 has a message text 14—template with parameters        filled    -   A log entry is contained in a log file (LF) and produced by a        Process and is therefore linked to computing unit    -   A computing unit or computer is referenced by a log entry in a        message    -   A configuration file (CF) on a computing unit can have multiple        configuration values    -   (CV) affecting the whole computing unit or specific processes    -   A configuration value can be directly referenced by a log entry        message or can have indirect relevance    -   A plant can have multiple Snapshots generated at different        points in time

Exemplary Applications

At present time most of the operation and control of industrialequipment is managed by standard or special control software. Humans maybe frequently engaged in a monitoring capacity, but only get involved inproblem situations. However, when such situations arise it may benontrivial to identify causes and potential solutions. The main way toget insight into operations of such computer-controlled systems is byexamining information from relevant log files. This task is performedmanually by experienced service technicians making it time-consuming andnot always as accurate as needed.

The knowledge graph provides the users e.g. experts and servicetechnicians an organized view of the log file data.

An exemplary use case is shown in FIG. 5. The log files can be collectedfrom different customer plants with SEMATIC systems. The knowledgeextraction process is described by the bottom part of the figure.

In a first step the log files are clustered. Log messages and timestamps are extracted by generic parsers. The messages can be used toextract templates. Further, the content of messages can be extracted.All information is inserted into a knowledge graph for further analysisaccording to the right part of the figure, like anomaly detection,failure prediction and root cause understanding by a combination ofstatistical and knowledge graph analytics.

Considering industrial applications and environments, the data can referto

-   -   Power plants. The power plants can have multiple turbines and        other pieces of equipment.    -   Modem factories. The factories can have multiple interacting        automated tools.    -   Trains. The trains can have multiple semi-autonomous systems,        for example for door control, climate control and for movement.    -   Medical equipment. The equipment can have separate controllers        for operating different movable parts e.g. the patient bed or        the scanning tools and the devices e.g. MRT for imaging and data        collection.

Reference Signs

S1 to S3 Method steps 1 to 3

10 log entry

12 time stamp (TS) of log entry

14 message of log entry

Although the present invention has been disclosed in the form ofpreferred embodiments and variations thereon, it will be understood thatnumerous additional modifications and variations could be made theretowithout departing from the scope of the invention.

For the sake of clarity, it is to be understood that the use of ‘a’ or‘an’ throughout this application does not exclude a plurality, and‘comprising’ does not exclude other steps or elements.

1. A computer-implemented method for generating a coherentrepresentation for at least two log tiles, comprising the steps: a.Receiving the at least two log files; wherein b. each log file of the atleast two log files comprises at least one log entry with at least onetime stamp and at least one message; wherein c. the at least two logfiles differ from one another with respect to at least one distinctivecriteria; d. Extracting at least one additional information of each logfile of the at least two log tiles; and e. Combining each log file ofthe at least two log files with the extracted additional informationinto at least two processed log files; wherein f. the at least twoprocessed log files comply with a coherent representation.
 2. The methodaccording to claim 1, wherein the at least one distinctive criteria isselected from the group comprising: type format and structure.
 3. Themethod according to claim 1, wherein the additional information is aninformation selected from the group comprising: a computing unit whichgenerated the log file, a program which generated the log file,configuration information of the computing unit which generated the logfile, a log entry template and a connection between a log entry and thecomputing unit the log entry references.
 4. The method according toclaim 1, wherein the coherent representation is an input for log miningor any other further analysis.
 5. The method according to claim 1,wherein the method comprises the further step of loading the coherentrepresentation into a knowledge graph.
 6. A computer program productdirectly loadable into an internal memory of a computer, comprisingsoftware code portions for performing the steps according to claim Iwhen said computer program product is running on a computer.
 7. Thegenerating unit for performing the steps according to claim 1.